The given dataset contains information on purchases made through the purchase card programs administered by the state and higher ed institutions. All transactions included in this dataset were online transactions. Through this report, we will summarise the observations and trends within the dataset and create features which can aid in building a model to identify fraudulent transacations, or anomalies!
#Loading required libraries and data for analysis
library(DT) #For easy viewing and interaction of dataframes
library(dplyr) #For data manipulation and wrangling
library(ggplot2) #For visualizations and comprehensive plotting
library(plotly) #For interactive and beautiful graphs
library(gridExtra); library(cowplot); #For better graph grids
ccd <- read.csv("purchase_credit_card.csv")
str(ccd) #To understand the data
## 'data.frame': 442458 obs. of 11 variables:
## $ Year.Month : int 201307 201307 201307 201307 201307 201307 201307 201307 201307 201307 ...
## $ Agency.Number : int 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ Agency.Name : chr "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" "OKLAHOMA STATE UNIVERSITY" ...
## $ Cardholder.Last.Name : chr "Mason" "Mason" "Massey" "Massey" ...
## $ Cardholder.First.Initial : chr "C" "C" "J" "T" ...
## $ Description : chr "GENERAL PURCHASE" "ROOM CHARGES" "GENERAL PURCHASE" "GENERAL PURCHASE" ...
## $ Amount : num 890 369 165.8 96.4 126 ...
## $ Vendor : chr "NACAS" "SHERATON HOTEL" "SEARS.COM 9300" "WAL-MART #0137" ...
## $ Transaction.Date : chr "07/30/2013 12:00:00 AM" "07/30/2013 12:00:00 AM" "07/29/2013 12:00:00 AM" "07/30/2013 12:00:00 AM" ...
## $ Posted.Date : chr "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" "07/31/2013 12:00:00 AM" ...
## $ Merchant.Category.Code..MCC.: chr "CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS" "SHERATON" "DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE CLASSIFIED" "GROCERY STORES,AND SUPERMARKETS" ...
The data contains 442,458 observations or transactions. For each transaction, we have 11 different variables providing information about the transaction:
In the next step, we fix the variable names by changing it to more standard R names. This makes it easier for us to code with the variables. We also fix datatypes for the following columns to enable easier data-wrangling: year_month, agency_number, vendor, MCC were converted into factors and transaction_date, posted_date were converted into PoSIXct date formats.
colnames(ccd) <- c("year_month", "agency_number", "agency_name", "ch_last_name",
"ch_first_initial", "description", 'amount', 'vendor',
'transaction_date', 'posted_date', 'MCC')
#fixed colnames into more standard R formats for easier syntax
#Fixing data types
ccd$year_month <- as.factor(ccd$year_month) #Converting year month into a factor to enable better graphs
ccd$agency_number <- as.factor(ccd$agency_number) #Converting agency name into a factor to enable better EDA
ccd$vendor <- as.factor(ccd$vendor) #Converting vendor into a factor to enable better EDA
ccd$MCC <- as.factor(ccd$MCC) #Converting MCC into a factor to enable better EDA.
ccd$transaction_date <- as.Date(ccd$transaction_date, format = "%m/%d/%Y %I:%M:%S %p")
ccd$posted_date <- as.Date(ccd$posted_date, format = "%m/%d/%Y %I:%M:%S %p")
#PoSIXct date formats are easier to work with in R
#These changes allow us to identify the following:
unique(ccd$year_month)
## [1] 201307 201308 201309 201310 201311 201312 201401 201402 201403 201404
## [11] 201405 201406
## 12 Levels: 201307 201308 201309 201310 201311 201312 201401 201402 ... 201406
#The data includes transactions from a 12-month period between July 2013 to June 2014.
sprintf("Number of unique agencies: %s", length(unique(ccd$agency_number)))
## [1] "Number of unique agencies: 116"
#116 unique agencies
sprintf("Number of unique vendors:",length(unique(ccd$vendor)))
## [1] "Number of unique vendors:"
sprintf("Number of unique Merchant Categories:", length(unique(ccd$MCC)))
## [1] "Number of unique Merchant Categories:"
#86279 unique vendors who fall into 435 unique categories
As can be seen the data includes data for a 12-month period between July 2013 to June 2014. Transactions were made by 116 unique agencies, across 86279 unique vendors who fall into 435 unique categories.
Let us look at the trends and patterns in the transaction amounts:
quantile(ccd$amount, c(0.001, 0.01, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.999))
## 0.1% 1% 25% 50% 75% 90% 95% 99%
## -1261.482 -166.000 30.910 104.890 345.000 816.780 1487.133 4298.568
## 99.9%
## 21915.972
sprintf("Mean amount is %s", mean(ccd$amount))
## [1] "Mean amount is 424.99117007716"
sprintf("Standard deviation is %s", sd(ccd$amount))
## [1] "Standard deviation is 5266.50910782293"
cat("Range:", range(ccd$amount));
## Range: -42863.04 1903858
#While the range for transaction amount varies between -42,863 to 1,903,858.4
#98% of the transactions lie between the much smaller range of -166 to 4,298.568
#99.8% of the transactions lie between -1,261.482 to 21,925.972
The transaction amounts vary widely from -42,863.04 to 1,903,858.37. However the average transaction amount is only 425 dollars and the standard deviation is 5267. This means the upper value is 362 standard deviations away and the lower value is 8 standard deviations away. In addition, negative values are reimbursements for earlier transactions.
When we look at the quantile distributions, we can see that 98.9% of the transactions have a value between -166 and 21,916. In fact, approximately 95% of of the transaction lie between 0 and 6000. Therefore, for most visual examinations we would limit transactions with amounts between 0 and 6000. And for modelling, the transacations with value between -1200 and 22,000 should be included. ***
plot1 <- ccd %>% filter(amount > 0 & amount < 1000) %>%
ggplot(aes(x = year_month, y = amount)) +
geom_boxplot(color = 'skyblue') + theme_minimal() +
xlab("Year & Month") + ylab("Transaction Amount") + ggtitle("Distribution in transaction amount across time")
ggplotly(plot1)
As can be seen, there isn’t much variation in the amount per transaction from one month to another.
Now let us look at the number of transactions by different variables. Since each variable has a large number of unique entities, a much better comparison would be to look at summary statistics. However, we can examine trends within the top 20%.
#Histograms for count of transactions by different variables
plot5 <- ccd %>% group_by(agency_number) %>% mutate(counts = n()) %>%
arrange(desc(counts)) %>% ungroup() %>%
top_frac(0.7) %>% ggplot(aes(x = reorder(agency_name, -counts))) +
geom_histogram(stat = 'count', fill = 'skyblue', color = 'skyblue') +
theme_minimal() + theme(axis.text.x = element_blank()) + xlab("Agency") +
ylab("Number of Transactions") + ggtitle("Number of transaction in top 70% agencies")
ggplotly(plot5) #Plot to see number of transactions ina agencies (top 20%)
plot6 <- ccd %>% group_by(vendor) %>% mutate(counts = n()) %>% arrange(desc(counts)) %>%
ungroup() %>% top_frac(0.2) %>% ggplot(aes(x = reorder(vendor, -counts))) +
geom_histogram(stat = 'count', fill = 'skyblue', color = 'skyblue') +
theme_minimal() + theme(axis.text.x = element_blank()) + xlab("Vendor") +
ylab("Number of Transactions") + ggtitle("Number of transaction in top 20% vendors")
ggplotly(plot6) #Plot to see number of transactions in vendors (top 20%)
plot7 <- ccd %>% group_by(MCC) %>% mutate(counts = n()) %>%
arrange(desc(counts)) %>% ungroup() %>%
top_frac(0.4) %>% ggplot(aes(x = reorder(MCC, -counts))) +
geom_histogram(stat = 'count', fill = 'skyblue', color = 'skyblue') +
theme_minimal() + theme(axis.text.x = element_blank()) + xlab("Merchant Category") +
ylab("Number of Transactions") + ggtitle("Number of transaction in top 40% merchant categories")
ggplotly(plot7) #Plot to see number of transactions in merchant categories (top 20%)
plot8 <- ccd %>% ggplot(aes(x = year_month)) +
geom_histogram(stat = 'count', color = 'skyblue', fill = "skyblue") +
theme_minimal() + xlab("Year & Month") + ylab("Number of Transactions") +
ggtitle("Number of transactions across time")
ggplotly(plot8) #Plot to see number of transactions in each month
Interestingly, 70% of the transactions in this data have been initiated by 7 agencies, of which the top 2 account for more than 50% of the transactions by volume. This limits the value we attain from using agency as the most granular unit, however since the cardholder names are not specific enough to differentiate well, we will use agency for this dataset.
However, the transactions are more widely distributed amongst the vendors. There are two different groups, with a small number having greater than twice the number of transactions as the rest of the vendors. This can potentially be used to group vendors into two categories.
The number of transactions are more naturally distributed across the merchant categories. The only dip in the number of transactions occurs in December and January. This is understandable since institutional spending is usually lower during the holidays.
As mentioned before, we will use the agency as the primary unit for each transaction. This means we want to identify transactions patterns for each agency and use that to identify anomalies. Let us begin by looking at summary statistics across agencies.
#creating summary statistics for each agency
stat_agency <- ccd %>% group_by(agency_name) %>%
dplyr::summarise(count_agency_all = n(), total_amount_agency_all = sum(amount),
avg_amount_agency_all = mean(amount),
min_amount_agency_all = min(amount),
max_amount_agency_all = max(amount)) %>%
arrange(desc(total_amount_agency_all)) %>% ungroup()
summary(stat_agency)
## agency_name count_agency_all total_amount_agency_all
## Length:124 Min. : 1.0 Min. : 36
## Class :character 1st Qu.: 130.5 1st Qu.: 24948
## Mode :character Median : 383.0 Median : 125616
## Mean : 3568.2 Mean : 1516458
## 3rd Qu.: 1360.2 3rd Qu.: 424601
## Max. :115995.0 Max. :33778840
## avg_amount_agency_all min_amount_agency_all max_amount_agency_all
## Min. : 22.02 Min. :-42863.0 Min. : 35.6
## 1st Qu.: 192.49 1st Qu.: -2250.0 1st Qu.: 1891.0
## Median : 264.74 Median : -812.8 Median : 4661.5
## Mean : 1705.26 Mean : -3113.7 Mean : 43649.1
## 3rd Qu.: 362.34 3rd Qu.: -211.3 3rd Qu.: 12366.4
## Max. :171619.61 Max. : 437.2 Max. :1903858.4
datatable(stat_agency, width = 300)
#plotting avg_amount vs count for each agency. The bubble size represents total amount.
plot9 <- stat_agency %>% filter(avg_amount_agency_all < 6000) %>%
ggplot(aes(x = avg_amount_agency_all, y = count_agency_all,
size = total_amount_agency_all,
fill = agency_name, color = agency_name)) +
geom_jitter() + theme_minimal() +
xlab("Average Transaction Amount") + ylab("Number of Transactions") +
theme(legend.position = "none")
ggplotly(plot9)
The number of transactions by each agency ranges from 1 to 115,995 with a mean of 3568 transactions. Similarly, The avg value of a transaction ranges from 22 to 171,620. However, the mean and median are comparatively low at 1705 and 265. The highest avg_transaction value is a 11 standard deviations away from the mean. Therefore, for our visual examination,we will limit the average amount value to 6,000.
As can be seen, Oklahoma State University is an outlier in terms of the number of transactions and total transaction amount as well. From the graph we see that the other agencies are clustered together.
#After grouping by agency, we create new variables to identify the last transaction date and the gap between each transaction and the last transaction, as well the lag between two subsequent transactions.
stat_agency2 <- ccd %>% group_by(agency_name) %>%
arrange(agency_name, transaction_date) %>%
mutate(last_transaction = max(transaction_date)) %>%
mutate(gap_transaction = last_transaction - transaction_date) %>%
mutate(lag_transaction = transaction_date - lag(transaction_date)) %>%
arrange(gap_transaction) %>% ungroup()
datatable(head(stat_agency2), width = 300)
Considering agency as the primary unit, we will use the RFM method (Recency, Frequency and Monetary value) to create features. For all features we create in this category, we will consider three time periods:
In a real-time system, the transaction being analyzed would be the last transaction by the agency.
Count: Number of transactions by each agency in the decided time period Total_amount: Sum of the value across all transactions by each agency in the decided time period Average_amount: Average value of a transaction by each agency in the decided time period Min_amount: Minimum value of a transaction by each agency in the decided time Max_amount: Maximum value of a transaction by each agency in the decided time Avg_lag_tnx: Average lag (in difftime) between two transactions by an agency in the decided time *Min_lag_tnx: Minimum lag (in difftime) between two transactions by an agency in the decided time
By using count across 1 day, 1 month and 3 months and all time, we would be able to identify if there is a sudden increase or decrease in the number of transactions. Both can point towards anomalous transactions. Combined with lag between transactions, we can identify anomalous behaviour in number of transactions. Since these are institutional investors, and based on the constant number of transactions each month, we can expect lag to not vary greatly between transactions.
For the value of each transaction, we are creating 4 factors across 3 time periods. By looking at average, minimum, maximum amount of a transaction we can identify anomalous behavior if a transaction value is too low or too high.
At this level of aggregation (Agency), we are creating 21 features in total.
#creating aggregate variables at the agency level and 1 day time period
stat_agency_day <- stat_agency2 %>% filter(gap_transaction == 0) %>% group_by(agency_name) %>%
dplyr::summarise(count_agency_day = n(),
total_amount_agency_day = sum(amount),
average_amount_agency_day = mean(amount),
min_amount_agency_day = min(amount),
max_amount_agency_day = max(amount),
avg_lag_tnx_day = mean(lag_transaction),
min_lag_tnx_day = min(lag_transaction)) %>% ungroup()
#creating aggregate vriables at the agency level and 1month
stat_agency_m <- stat_agency2 %>% filter(gap_transaction < 31) %>% group_by(agency_name) %>%
dplyr::summarise(count_agency_m = n(),
total_amount_agency_m = sum(amount),
average_amount_agency_m = mean(amount),
min_amount_agency_m = min(amount),
max_amount_agency_m = max(amount),
avg_lag_tnx_m = mean(lag_transaction),
min_lag_tnx_m = min(lag_transaction)) %>% ungroup()
#creating aggregate vriables at the agency level and 3months
stat_agency_3m <- stat_agency2 %>% filter(gap_transaction < 91) %>% group_by(agency_name) %>%
dplyr::summarise(count_agency_3m = n(),
total_amount_agency_3m = sum(amount),
average_amount_agency_3m = mean(amount),
min_amount_agency_3m = min(amount),
max_amount_agency_3m = max(amount),
avg_lag_tnx_3m = mean(lag_transaction),
min_lag_tnx_3m = min(lag_transaction)) %>% ungroup()
#Using left_join to tag each agency with the appropriate aggregte variable values
stat_agency <- left_join(stat_agency, stat_agency_day, by = 'agency_name')
stat_agency <- left_join(stat_agency, stat_agency_m, by = 'agency_name')
stat_agency <- left_join(stat_agency, stat_agency_3m, by = 'agency_name')
datatable(head(stat_agency), width = 300)
We will create two more levels of aggregation:
Transactions at each vendor by each agency Transactions at each merchant category by each agency
By creating the same 28 variables for both these levels of aggregation, we gain further insights into the pattern of purchases for each individual agency-vendor pair and for each individual agency-merchant category pair.
Using count of transactions across each aggregation level and each time period, allows us to identify anomalous behavior in terms of types of purchases. For example, if there is a sudden increase in purchases at a specific vendor or merchant category. Similarly, we can use amount and lag as well.
#Collating data by Agency and merchant(vendor)
#agency x merchant for all time
stat_agency_merchant <- ccd %>% group_by(agency_name, vendor) %>%
dplyr::summarise(count_merchant_all = n(), total_amount_merchant_all = sum(amount),
avg_amount_merchant_all = mean(amount),
min_amount_merchant_all = min(amount),
max_amount_merchant_all = max(amount)) %>%
arrange(desc(total_amount_merchant_all)) %>% ungroup()
#Viewing transactions by merchant for Oklahoma State University
plot11 <- stat_agency_merchant %>%
filter(avg_amount_merchant_all < 6000
& agency_name == "OKLAHOMA STATE UNIVERSITY") %>%
ggplot(aes(x = avg_amount_merchant_all, y = count_merchant_all,
size = total_amount_merchant_all, fill = vendor, color = vendor)) +
geom_jitter() + theme_minimal() + xlab("Average Transaction Amount") +
ylab("Number of Transactions") +
ggtitle("Transactions by vendor for Oklahoma State University") +
theme(legend.position = "none")
ggplotly(plot11)